Preamble

Load the Packages

First load in the packages that will be required:

if(require('pacman')){
    library('pacman')
  }else{
    install.packages('pacman')
    library('pacman')
  }
## Loading required package: pacman
  pacman::p_load(ggmap, plotly, EnvStats, ggplot2, GGally, corrplot, dplyr, tidyr, stringr, reshape2, cowplot, ggpubr, reshape2, ggplot2, rmarkdown, dplyr, plotly, rstudioapi, wesanderson, RColorBrewer, colorspace)

Load in the Dataset

Load the data set.

# Load Dataset ------------------------------------------------------------

all.df <- read.csv(file = "practical06a.csv", TRUE, ",")
#all.df <- read.csv(file = file.choose(), TRUE, ",")


# Convert the Output to a categorical varialbe (factor) -------------------
all.df$y <- as.factor(all.df$y)
head(all.df)
##   X       x1       x2 y
## 1 1 23.43869 20.09662 0
## 2 2 24.05539 17.37254 0
## 3 3 24.46357 16.14766 0
## 4 4 19.00521 14.08636 0
## 5 5 14.44262 17.69164 0
## 6 6 15.94049 11.98209 0

Plot the Data

Where possible it can be really useful to visualise data ##Base plot Owing to the need to colour the variables, it can be more difficult to use a base plot in this case:

PlotCol.vec <- rainbow_hcl(2)

plot(x2 ~ x1, col = PlotCol.vec[c(y)], data = all.df, cex = 2,
     xlab = "Predictor 1",
     ylab = "Predictor 2",
     main = "Categorised Variables" )

Prettier ggplot2

A better looking plot can be created in ggplot2:

# Plot the Data -----------------------------------------------------------
  ##Outcome by colour
col.plot <- ggplot(all.df, aes(y = x2, x = x1, col = y)) +
  geom_point() + 
  theme_classic() +
  labs(x = "Predictor 1", y = "Predictor 2",
       title = "Discrete Outcome From Predictor Values",
       col = "Output")

col.plot

Side by side plot

The predictors predict the output, like so:

  ##Side by Side
outvx1.plot <- ggplot(all.df, aes(y = y, x = x1, col = y)) +
  geom_point() + 
  theme_classic() +
  labs(x = "Predictor 1", y = "Outcome",
       title = "Discrete Outcome From Predictor Values")

outvx2.plot <- ggplot(all.df, aes(y = y, x = x2, col = y)) +
  geom_point() + 
  theme_classic() +
  labs(x = "Predictor 2", y = "Outcome",
       title = "Discrete Outcome From Predictor Values") 


ggarrange(outvx1.plot, outvx2.plot)

Create a Logistic Regression Model

Convert the discrete output variable to a factor

In machine learning discrete variables are known as categorical variables and the dependent variable of a model is known as the output variable.

It is necessary before fitting a logistic regression model to ensure that the class of the output variable is as a factor, not numeric or integer because the model may perform suboptimally.

for the sake of plotting, consider adding another column to the data frame such that the output variable is contained both as an integer and a factor within the data frame, it can be useful when arguing with plots.

# Convert the Output to a categorical varialbe (factor) -------------------
all.df$y <- as.factor(all.df$y)
head(all.df)
##   X       x1       x2 y
## 1 1 23.43869 20.09662 0
## 2 2 24.05539 17.37254 0
## 3 3 24.46357 16.14766 0
## 4 4 19.00521 14.08636 0
## 5 5 14.44262 17.69164 0
## 6 6 15.94049 11.98209 0

Create the Regression Model

a logistic regression can be modelled thusly using the glm function like so:

log.mod <- glm(y ~ x1 + x2,
               data   = all.df,
               family = "binomial")

The family parameter specifies the type of model being built because the glm() function can be used to build many different types of regression, in this case family = "binomial" tells R to perform a logistic regression

Model Coefficients

The model coefficients can be returned thusly:

log.mod$coefficients
## (Intercept)          x1          x2 
## -33.2287692   0.4351529   1.1788718

Thus the model is:

\[\frac{1}{1+e^{-(-33 +0.44x_1+1.2x_2)}}\]

Making Predictions

By default the probabilities returned are the ‘log-odds’ values, the predicted probabilites can be returned by specifying the type = response parameter within the predict() function.

Plot the Decision Boundary

The decision boundary, as a straight line, can be sourced from the model: ##Solve for the Line In this case we have to make the intercept and slope relative to the x2 coefficient being represented on the y-axis

 ##The decision boundary, is the line:
intercept <- (-1)*log.mod$coefficients["(Intercept)"]/log.mod$coefficients["x2"]
slope     <- (-1)*log.mod$coefficients["x1"]/log.mod$coefficients["x2"]

Base Plot

Using base packages the decision boundary can be plotted

PlotCol.vec <- rainbow_hcl(3)

plot(x2 ~ x1, col = PlotCol.vec[c(y)], data = all.df, cex = 2,
     xlab = "Predictor 1",
     ylab = "Predictor 2",
     main = "Categorised Variables" )
abline(intercept, slope, col = PlotCol.vec[3])

ggplot2

Using ggplot2 the decision boundary can be plotted:

 col.plot +
   geom_abline(intercept = intercept, slope = slope, lwd = 1, col = "purple")

Probability predictions

The probability of the features indicating a positive output can be viewed thusly:

all.df$prob <- predict(log.mod, type = 'response')
head(all.df, 8)
##   X       x1       x2 y         prob
## 1 1 23.43869 20.09662 0 6.597062e-01
## 2 2 24.05539 17.37254 0 9.270866e-02
## 3 3 24.46357 16.14766 0 2.799435e-02
## 4 4 19.00521 14.08636 0 2.357297e-04
## 5 5 14.44262 17.69164 0 2.265144e-03
## 6 6 15.94049 11.98209 0 5.199846e-06
## 7 7 22.66409 16.17056 0 1.334191e-02
## 8 8 19.52361 10.95170 0 7.338503e-06

Model Visualisation

There are many ways to visualise the probability model trained by the data.

Side-On

From a side on view the model would look like this:

##         x1      x2         pred
## 1 8.720764 8.61105 4.223595e-09
## 2 8.959804 8.61105 4.686591e-09
## 3 9.198843 8.61105 5.200341e-09
## 4 9.437883 8.61105 5.770409e-09
## 5 9.676922 8.61105 6.402968e-09
## 6 9.915962 8.61105 7.104870e-09
##   X       x1       x2 y         prob
## 1 1 23.43869 20.09662 0 6.597062e-01
## 2 2 24.05539 17.37254 0 9.270866e-02
## 3 3 24.46357 16.14766 0 2.799435e-02
## 4 4 19.00521 14.08636 0 2.357297e-04
## 5 5 14.44262 17.69164 0 2.265144e-03
## 6 6 15.94049 11.98209 0 5.199846e-06

3d Modelled Probability

The modelled probability of each observation exhibiting the output can be visualised as a 3d plot:

3d Surface Plot

A plot of the model as a 3d probability surface can also be visualised:

Base Graphics

###Interactive Plotly This is better viewed with an interactive Plotly graph: